User Defined functions / classes and library initiations

below function implements the following steps
4. Data pre-processing:
B. Check for target balancing and ix it if found imbalanced.
5. Model training, testing and tuning:
A. Use any Supervised Learning technique to train a model.
E. Display and explain the classi ication report in detail.
6. Post Training and Conclusion:
A. Display and compare all the models designed with their train and test accuracies.

Project

DOMAIN: Semiconductor manufacturing process
• CONTEXT:
A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.
• DATA DESCRIPTION: sensor-data.csv : (1567, 592)
The data consists of 1567 datapoints each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that speci ic test point.
• PROJECT OBJECTIVE:
We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.

Steps and tasks:
1.Import and understand the data.
A. Import ‘signal-data.csv’ as DataFrame.
B. Print 5 point summary and share at least 2 observations.

every column is a numeric data except for Time column
later, lets see if we could extract features from Time column else drop it
Also the Time column seems to have duplicates, which could be the same with all the other columns too.
need to confirm to drop those.

There are few constant columns like "13","42",...
There are few extreme skewed or quasi-constant columns like "4","21"...
There are few near-perfect bell curves like "24"

Need to review and remove columns that doesn't add information to target In reference to the target, the dataset seems imbalanced as more than 75% of data corresponds to -1

2.Data cleansing:
A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.
B. Identify and drop the features which are having same value for all the rows.

safe to continue without dropping any records

none found, hence lets proceed

----------------------------------------------------------------------------
Let us set a base line model using DecisionTreeClassifier

pretty impressive accuracy and low execution time
but unfortunately the precision, recall and f1_score for FAIL class (+1) is very poor
they are poor in training data prediction, probably because of imbalanced data
in the test data prediction, those have fallen even lower, indicating over-fit model
Lets build on our modelling

before proceeding further, lets extract some timestamp features & inherent clusters

the below snippet adds to the following step
2.Data cleansing:
E. Make all relevant modi ications on the data using both functional/logical reasoning/assumptions.

below function implements the following steps
4. Data pre-processing:
B. Check for target balancing and ix it if found imbalanced.
5. Model training, testing and tuning:
A. Use any Supervised Learning technique to train a model.
E. Display and explain the classification report in detail.
6. Post Training and Conclusion:
A. Display and compare all the models designed with their train and test accuracies.

the model perfromance has significantly improved in terms of FAIL class
probably caused by the combined effect of feature additons, standardisation & target class balancing

2.Data cleansing:
C. Drop other features if required using relevant functional knowledge. Clearly justify the same.

the variances (or standard deviations) of several fearutes are condensed below unity
this indicates a that several features would not contribute to the model learning
though z-score tranformation will shift & rescale the distributions, it would also leverage all the noises in the data towards model learning
hence let us use few feature selection techniques to shrink our dataset

149 Quasi constant features were trimmed off leaving behind 306 features

though the test scores seems to have reduced, the quantum of features dropped is a good trade off against it
lets study furhter

SCFS (Standard deviation and Cosine similarity based Feature Selection)
Reference article for feature scoring
using custom method based on published paper from
https://www.frontiersin.org/articles/10.3389/fgene.2021.684100/full
Credits to: Juanying Xie, Mingzhao Wang, Shengquan Xu, Zhao Huang and Philip W. Grant

Explanation & Justification to use the method
The discernibility of a feature, refers to its distinguishable capability between categories
Feature selection aims to detect the features whose distinguishable capability is strong while the redundancy between them is less
To represent the redundancy between a feature and the other features, cosine similarity is used
Feature independence is deduced from cosine similarity ( in 3 possible ways)
The method guarantees that a feature will have the maximal independence as far as possible once it has the maximal discernibility

the feature discernibility scales has over powered the feature independence scale, thus the above curve seems asymptotic to axes
let us perform standardisation and then use the SCFS technique

standard scaler has changed all feature discernibility to unity rendering no meaningful information
let us try minmaxscaler

the above feature scoring plot seems meaningful with an approximate elbow formed around a certain feature score
let us try other 2 independence score kinds using same minmaxscaler

reciprocal method gives significantly reduced number of features

since reciproval method produces a better elbow, and returns minimal features
lets choose reciprocal independence method with threshold of 0 scores in log scale

all low scored features have been removed leaving 34 features to go ahead

though the scores seems to have reduced, the divide between training and testing scores have greately reduced, inferring significant reduction in data noise
this justifies the power of SCFS methodology

Let us try to study feature importances from the DTree classifier

the above feature importance plot shows a gradual increase/decrease of influence by every next feature
we cannot justify to drop any further features from the above, hence lets move on

-----------------------------------------------------------------------------------------------
by now, the following project statements have been covered in various sections and mentioned here to keep track

  1. Import and understand the data.
    A. Import ‘signal-data.csv’ as DataFrame.
    B. Print 5 point summary and share at least 2 observations.
  2. Data cleansing:
    A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.
    B. Identify and drop the features which are having same value for all the rows.
    C. Drop other features if required using relevant functional knowledge. Clearly justify the same.
    (quasi constants removed, SCFS method applied)
    E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.
    (timestamp features extracted, inherent clusters identified)
  3. Data pre-processing:
    A. Segregate predictors vs target attributes.
    B. Check for target balancing and fix it if found imbalanced.
    (implmented inside pipe method)
    C. Perform train-test split and standardise the data or vice versa if required.
    (implmented inside pipe method)
  4. Model training, testing and tuning:
    A. Use any Supervised Learning technique to train a model.
    (implmented as pipe)

2.Data cleansing:
D. Check for multi-collinearity in the data and take necessary action.

there had been 491 cases of multi-colinearity pairs within just 306 features after quasi constant feature elimination

there are no case of high correlation, since already SCFS has taken dependence of features in to consideration for feature scoring

it could be seen that the maximum correlation is 0.7 or -0.6, hence there is not much multicolinearity, except for countably 2 or 3 pairs

lets investigate further, using Variance Inflation Factors
by definition, the variance inflation factor is a measure for the increase of the variance of the parameter estimates if an additional variable, given by exog_idx is added to the linear regression. It is a measure for multicollinearity of the design matrix, exog.
One recommendation is that if VIF is greater than 5, then the explanatory variable given by exog_idx is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.
hence features having VIF above 5 needs to be studied for dropping

the recal scores have improved, with significant drop of features
yet going further, let us test the models with both SCFS trimmed & VIF Trimmed data

2.Data cleansing:
E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.

skew correction has decreased the scores in case of VIF trimmed dataset
where as for SCFS trimmed dataset there is no change in scores

apart from above skew correction,
timestamp feature extraction
and cluster feature extraction
were performed earlier in the notebook

there are no duplicate records

as expected, the SCFC method and VIF methods would have removed any duplicated features out of similarity & collinearity

-----------------------------------------------------------------------------------------------------------

3. Data analysis & visualisation:
A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis.
B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.

the data follows a normal distribution, but with several extreme values on either sides means of pass & fail classes are not far off compared to the range of the data

follows bell curve
difference in target class means found

does not follow normal distribution
twin peaks found, one near 0, and another near 400
difference in target class means found

does not follow normal distribution
twin peaks found, one near 0, and another near 400
difference in target class means found, but on the opposite direction from previous feature

above two features exhibit similar distributions
yet SCFS & VIF methods has confimed that these are not related

a skewed bell curve found
central tendencies are close for target classes

twin peak found

heavy peak found close to zero, causing high skewed distribution

above two features exhibit similar distributions
yet SCFS & VIF methods has confimed that these are not related

close to uniform distribution

above two features exhibite near uniform distribution

last two features are synthesised features of inherent data clusters
doesn't follow any distribution

Let plot a bivariate pair plot and study further

from the above plot one may not be able to decipher any relations as the datapoints have clouded all over their space
let us study further

-----------------------------------------------------------------------------------------------------------

4. Data pre-processing:
D. Check if the train and test data have similar statistical characteristics when compared with original data.

X_train_SCFS, X_test_SCFS, Y_train, Y_test

the target class is almost equally distributed in the training & testing dataset

most of the statistical indices of the feature are closely similar in training & testing dataset

-----------------------------------------------------------------------------------------------------------

5. Model training, testing and tuning:
A. Use any Supervised Learning technique to train a model.
B. Use cross validation techniques.
Hint: Use all CV techniques that you have learnt in the course.
C. Apply hyper-parameter tuning techniques to get the best accuracy.
Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
D. Use any other technique/method which can enhance the model performance.
Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.
E. Display and explain the classi ication report in detail.
F. Apply the above steps for all possible models that you have learnt so far.

# reference datasets X_train_vif, X_test_vif, Y_train, Y_test X_train_SCFS, X_test_SCFS, Y_train, Y_test

the LOOCV provides a caution about widened confidence interval,
yet consumes more compute times.
for the upcoming models lets stick to KFold

lower than train data

Test accuracy not improved

test accuracy dropped lower
need more trials
lets build a custome pipe

the best model was XGBClassifer owing to excellent boosting trees

the accuracy has come to 91%, but test recall has been very poor, due to over fit

-----------------------------------------------------------------------------------------------------------